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Abstract. We present a comparison of several modern C++ libraries providing high-level in- 
terfaces for programming multi- and many-core architectures on top of CUDA or OpenCL. The 
comparison focuses on the solution of ordinary differential equations and is based on odeint, a frame- 
work for the solution of systems of ordinary differential equations. Odeint is designed in a very flexible 
way and may be easily adapted for effective use of libraries such as Thrust, MTL4, VexCL, or Vi- 
ennaCL, using CUDA or OpenCL technologies. We found that CUDA and OpenCL work equally 
well for problems of large sizes, while OpenCL has higher overhead for smaller problems. Further- 
more, we show that modern high-level libraries allow to effectively use the computational resources 
of many-core CPUs or multi-core CPUs without much knowledge of the underlying technologies. 
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1. Introduction. Recently, general purpose computing on graphics processing 
units (GPGPU) has acquired considerable momentum in the scientific community. 
This is confirmed both by increasing numbers of GPGPU-related publications and 
GPU-based supercomputers in the TOP500 1 list. Major programming frameworks 
are NVIDIA CUDA and OpenCL. The former is a proprietary parallel computing ar- 
chitecture developed by NVIDIA for general purpose computing on NVIDIA graphics 
adapters, and the latter is an open, royalty-free standard for cross-platform, paral- 
lel programming of modern processors and GPUs maintained by the Khronos group. 
By nature, the two frameworks have their distinctive pros and cons. CUDA has a 
more mature programming environment with a larger set of scientific libraries, but 
is available for NVIDIA hardware only. OpenCL is supported on a wide range of 
hardware, but its native API requires a much larger amount of boilerplate code from 
the developer. 

Both technologies are able to provide scientists with the vast computational re- 
sources of modern GPUs at the price of a steep learning curve. Programmers need 
to familiarize themselves with a new programming language and, more importantly, 
with a new programming paradigm. However, the entry barrier may be lowered with 
the help of specialized libraries. The CUDA Toolkit includes several such libraries 
(BLAS implementations, Fast Fourier Transform, Thrust and others). OpenCL lacks 
standard libraries, but there are a number of third-party projects aimed at developing 
both CUDA and OpenCL programs. 

This paper presents a comparison of several modern C++ libraries aimed at ease 
of GPGPU development. We look at both convenience and performance of the li- 
braries under consideration in the context of solving ordinary differential equations. 
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The comparison is based on odeint 2 , a modern C++ library for solving ordinary dif- 
ferential equations (ODEs) numerically [2, 3] that has been accepted for inclusion 
into the Boost libraries 3 recently. It is developed in a generic way using template 
meta-programming techniques, which lead to extraordinary high flexibility at utmost 
performance. The numerical algorithms are implemented independently of the un- 
derlying arithmetics. This results in a broad applicability of the library, especially 
in non-standard environments. For example, odeint supports matrix types, arbitrary 
precision arithmetics, and can be easily adopted to use either CUD A or OpenCL 
frameworks. The GPGPU libraries considered in this work are Thrust, MTL4, VexCL, 
and ViennaCL: 

Thrust is a parallel algorithms library 4 which resembles the C++ Standard Template 
Library [ ]. Thrust's high-level interface greatly enhances developer produc- 
tivity while enabling performance portability between GPUs and multi-core 
CPUs. Thrust is distributed with the NVIDIA CUDA Toolkit since ver- 
sion 4.1. 

MTL4 (The Matrix Template Library) 5 is a C++ linear algebra library that provides 
an intuitive interface by establishing a domain-specific language embedded 
in C++ [13]. The library aims for maximal performance achievable by high- 
level languages using compile-time transformations. Currently, three versions 
exist: the open-source edition that supports single- and multi-core CPUs, the 
supercomputing edition providing generic MPI-based parallelism, and the 
CUDA edition that is introduced in this paper. 

VexCL is a vector expression template library 6 for OpenCL [12]. It has been created 
for ease of OpenCL development with C++. VexCL strives to reduce the 
amount of boilerplate code needed to develop OpenCL applications. The 
library provides a convenient and intuitive notation for vector arithmetic, 
reduction, and sparse matrix-vector multiplication. Multi-device and even 
multi-platform computations are supported. 

ViennaCL (The Vienna Computing Library) is a scientific computing library 7 writ- 
ten in C++ [32]. CUDA and OpenMP compute backends were added recently, 
but only the initial OpenCL backend is considered in the remainder of this 
work. The programming interface is compatible with Boost. uBLAS 8 and al- 
lows for simple, high-level access to the vast computing resources available on 
parallel architectures such as GPUs. The library's primary focus is on com- 
mon linear algebra operations (BLAS levels 1, 2 and 3) and the solution of 
large sparse systems of equations by means of iterative methods with optional 
preconditioners. 

CUDA and OpenCL differ in their handling of compute kernels compilation. In 
NVIDIA's framework the compute kernels are compiled to PTX code together with the 
host program. PTX is a pseudo-assembler language which is compiled at runtime for 
the specific NVIDIA device the kernel is launched on. Since PTX is already very low- 
level, this just-in-time kernel compilation has low overhead. In OpenCL the compute 
kernels are compiled at runtime from higher-level C-like sources, adding an overhead 
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which is particularly noticeable for smaller sized problems. A pre-compilation to some 
low-level pseudo-code as in CUDA is not feasible in OpenCL because of hardware 
agnosticism by design. 

The approach taken for the generation and compilation of the compute kernels 
is one of the main differences between the OpenCL libraries we considered. VexCL 
generates and compiles an OpenCL program with a single kernel for each vector 
expression it encounters. This leads to potentially higher initialization overhead, but 
should prove to be more effective in long runs. On the other hand, ViennaCL uses a 
set of predefined kernels, which functionally overlaps with BLAS level f routines for 
vector operations. These kernels are compiled in batch at the program start to allow 
for faster initialization. However, due to this design decision, vector expressions with 
several operands may result in the launch of more than one kernel. It should be noted 
that because of the main focus of ViennaCL being on iterative solvers for large sparse 
systems of equations, where complex vector expressions are rare, predefined kernels 
are favorable in such a setting. The implications of these two distinct design decisions 
in VexCL and ViennaCL will be further addressed in the discussion of the results. 

The other difference between CUDA and OpenCL is that CUDA supports a sub- 
set of the C++ language in compute kernels, while OpenCL kernels are written in a 
subset of C99. Therefore, CUDA programmers may use template meta-programming 
techniques which may lead to more efficient and compact code. The native OpenCL 
API does not provide such features, but the drawback is balanced by the ability of 
kernel source generation during runtime. Modern C++ libraries such as those con- 
sidered in this work successfully use this approach and hide gory details from their 
users. 

2. Adapting odeint. Ordinary differential equations play a major role in many 
scientific disciplines. They occur naturally in the context of mechanical systems, 
like granular [30] and molecular dynamics. In fact, the Newtonian and Hamiltonian 
mechanics are formulated as ODEs [23]. Many other applications can be found in 
such diverse fields as biology [9, 27] and neuroscience [19], chemistry [4], and social 
sciences [17]. Furthermore, ODEs are also encountered in the context of the numerical 
solution of non-stationary partial differential equations (PDEs), where they occur after 
a discretization of the spatial coordinates [18]. 

Odeint solves the initial value problem (IVP) of ordinary differential equations 
given by 

(2.1) ^L=x = f(x,t), x(0)=x . 

Here, x is the dependent variable and is usually a vector of real or complex values. 
t is the independent variable. We will refer to t as the time throughout the article 
and denote the time derivative with dx/dt = x. f(x,t) is the system function and 
defines the ODE. 

Typical use cases for solving ODEs on GPUs are large systems of coupled ODEs 
which occur as discretizations of PDEs, or ODEs defined on lattices or graphs. An- 
other use case are parameter studies, where the dependence of an ODE on some 
parameters is of interest. Here, a high-dimensional ODE consisting of many low- 
dimensional uncoupled ODEs, each with a different parameter set, is considered. This 
one large system is then solved at once, hence all low-dimensional ODEs are solved 
simultaneously. 

Numerous methods for solving ODEs exist [14, 15, 31], which are usually catego- 
rized in the field of numerical analysis. Odeint implements the most prominent of these 
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methods, for example the classical Runge-Kutta methods and Runge-Kutta-Fehlberg 
methods, multi-step methods (Adams-Bashforth-Moulton) , symplectic Runge-Kutta- 
Nystrom methods, and implicit methods (Roscnbrock and implicit Euler). All of these 
methods work iteratively, starting from a given initial value x(to) to calculate the next 
value x(t + At). At is the step size and may be chosen either statically or adaptively. 
For reference, we note that the simplest method is the explicit Euler scheme 

(2.2) x(t + At)=x(t) + Atf{x{t),t). 

Its global accuracy is of first order, but the scheme is usually not used for real appli- 
cations because of stability and accuracy issues. 

One main feature of odcint is the decoupling of the specific algorithm for solving 
the ODE from the underlying arithmetic operations. This is achieved by a combina- 
tion of a state type, an algebra, and operations. The state type represents the state of 
the ODE being solved and is usually a vector type like std :: vectoro, std:: array <>, 
or a vector residing on a GPU. The algebra is responsible for iterating through all 
elements of the state, whereas the operations are responsible for the elementary op- 
erations. 

To see how the explicit Euler method (2.2) is translated to code in odeint, we 
briefly discuss its implementation: 

template< class State, class Algebra, class Operations > 
class euler { 

// 

template< class Ode > 

void do_step(Ode ode, State &x, time_type t, time_type dt) { 
odc(x, m_dxdt, dt); 

Algebra:: for _each3( x, x, m_dxdt, Operations::scale_sum2(1.0, dt) ); 

} 

h 

The state type, the algebra, and the operations enter the Euler method as template 
parameters, hence they are exchangeable. The function object ode represents the 
ODE and must be provided by the user. It calculates the right hand side f(x,t) of 
(2.1) and stores the result in m_dxdt. The call of for_each3 iterates simultaneously over 
all elements of three vectors and applies on each triple scale_sum2. The operation is 
performed in-place, meaning that x is updated to the new value. In the code-snippet 
above, the call to for_each3 is thus equivalent to the vector operation 

x = 1.0 * x + dt * m_dxdt 

which is just Eq. (2.2), since m_dxdt holds the values of f(x(t),t). 

An odeint algebra is a class consisting of for_eachl, . . . , for_eachN methods. For 
example, the for_each3 method in the range_algebra — the default algebra for most 
vector types — is similar to 

struct range_algebra { 

// - 

template< class Rl, class R2, class R3, class Op > 

void for_cach3(Rl &rl, R2 &r2, R3 &r3, Op op) { 
auto itl = boost::begin(rl); 
auto it2 = boost::begin(r2); 
auto it3 = boost::begin(r3); 
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while( itl != boost ::end(rl) ) op(*itl++, *it2++; *it3++); 

} 

// 

h 

The operations are represented by a struct with public member classes defining the 
operations used by the algebras. There is only one default operations class implemen- 
tation in odeint, which uses the arithmetic operators as usual: 

struct default_operations { 

// 

template< class Fact, class Fac2 > 
struct scale_sum2 { 

Fact m_facl; 

Fac2 m_fac2; 

scale_sum2( Fact fact, Fac2 fac2 ) : m_facl(facl), m_fac2(fac2) { } 
template< class SI, class S2, class S3 > 

void operator()( SI &sl, const S2 &s2, const S3 &s3 ) const { 

si = m_facl * s2 + m _fac2 * s3; 

} 

}; 

// - 

h 

The main reason for the separation of algebra and operations is that all arith- 
metic calculations and iterations are completely encapsulated into the algebra and the 
operations. Therefore, the numerical algorithms to solve the ODEs are independent 
from the underlying arithmetics. Note that the algebra and the operations must be 
chosen such that they interact correctly with the state type. 

Many libraries for vector and matrix types provide expression templates [35, 36, 
37] for the elementary operations using operator overload convenience. Such libraries 
do not need to define their own algebra, but can instead be used with a default 
algebra and a default operation set included in odeint, which simply call the operations 
directly on the matrix or vector type. 

We describe the adaptation of odeint for the GPGPU libraries under consideration 
in the following. The adaptions are now part of odeint, thus native support for 
these libraries is available. Implementation details such as the resizing of vectors is 
accomplished in a straight-forward manner and not further addressed for the sake of 
conciseness. 

To adapt Thrust to odeint, we need to provide both an algebra and operations. 
The algebra needs to define the for_each family of algorithms. All of these operations 
follow the same pattern, so we consider for_each3 only: 

struct thrust_algebra { 

template<class StateTypel, class StateType2, class StateType3, class Op> 
static void for_each3(StateTypel &sl, StateType2 &s2, StateType3 &s3, Op op) { 
thrust :: for_each( 

thrust :: make_zip ^iterator ( thrust :: make_tuple( 

si. begin (), s2. begin (), s3.begin() ) ), 
thrust :: make_zip .iterator ( thrust :: make_tuple( 
sl.end(), s2.end(), s3.end() ) ), 

op); 

} 
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h 

Here, thrust :: make_zip_iterator is used in combination with make_tuple to pack several 
device vector iterators into a single iterable sequence. The sequence is then processed 
by the thrust :: for_each algorithm, applying the function object op to each entry. 

The operations called via the function object op are defined in thrust_operations 
and are actually function objects executed on the respective CUDA device: 

struct thrust_operations { 

template<class Fact = double, class Fac2 = Facl> 
struct scale_sum2 { 

const Fact m_alphal; 

const Fac2 m_alpha2; 

scale_sum3 (const Fact alphal, const Fac2 alpha2) 
: m_alphal(alphal), m_alpha2(alpha2) { } 

template< class Tuple > 

—host— —device— void operator ()( Tuple t ) const { 
thrust :: get<0>(t) = m_alphal * thrust::get<l>(t) 
+ m_alpha2 * thrust::get<2>(t); 

} 

}; 

h 

The device function object uses thrust :: geto functions to unpack the zip iterator 
into separate values. This approach is heavily used with Thrust and allows to process 
several vectors in a single efficient sweep. 

MTL4, VexCL, and ViennaCL libraries provide convenient expression templates 
that may be directly used with odeint's vector_space_algebra and default -operations. 
This combination proved to be effective with MTL4 and VexCL, where each expres- 
sion results in a single kernel. For ViennaCL, however, default operations involving 
more than two terms result in multiple kernel launches. Moreover, temporary vectors 
are allocated and deallocated for each of such composite operations, resulting in a 
dramatic decrease of performance. To address such problems, ViennaCL provides a 
kernel generator [34], which is able to generate specialized operations for ViennaCL. 
These use a custom kernel generation mechanism provided by the library. For exam- 
ple, the scale_sum2 operation is defined as: 

struct viennacLoperations { 

template<class Facl = double, class Fac2 = Facl> 
struct scale_sum2 { 

// 

template<class Tl, class T2, class T3> 
void operator()( viennacl::vector<Tl> &vl, 

const viennacl:: vector <T2> &v2, 

const viennacl:: vector <T3> &v3) const 

{ 

using namespace viennacl; 
// instantiate symbolic types: 

static generator :: symbolic_vector <0, Tl> sym_vl; 
static generator :: symbolic_vector <1, T2> sym_v2; 
static generator :: symbolic_vector <2, T3> sym_v3; 
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static generator :: cpu_symbolic_scalar<3, Facl> sym_al; 
static generator :: cpu_symbolic_scalar<4, Fac2> sym_a2; 

// run the kernel generation at first call of this function: 
static generator :: custom_operation op( 

sym_vl = sym_al * sym_v2 + sym_a2 * sym_v3, 

"scale_sum2"); 

// launch the kernel: 

ocl :: enqueue( op(vl, v2, v3, m_alphal, m_alpha2) ); 

} 

}; 

h 

Here, a custom OpenCL kernel is automatically generated from symbolic vector ex- 
pression in the first call of the operatorQ and then reused for all subsequent calls. 

3. Numerical Experiments. As shown in the previous section, all three GP- 
GPU libraries considered in our comparison could be adapted to odeint without get- 
ting in contact with low-level CUDA or OpenCL code. The purpose of this section is 
to evaluate the performance of the GPGPU libraries and whether there is a price to 
pay for the high-level interface. 

3.1. Lorenz Attractor Ensemble. In the first example we consider the Lorenz 
system [25]. The Lorenz system is a system of three coupled ODEs which shows 
chaotic behavior for a large range of parameters. It is one of the most frequently used 
ODEs for evaluation purposes in the nonlinear dynamics community. The equations 
for the Lorenz system read 

(3.1) x = —a (x — y) , y = Rx — y — xz, z = —bz + xy. 

Solutions of the Lorenz system usually furnish very interesting behavior in depen- 
dence on one of its parameters. For example, one might want to study the chaoticity 
in dependence on the parameter R. Therefore, one would create a large set of Lorenz 
systems (each with a different parameter R), pack them all into one system and 
solve them simultaneously. In a real study of chaoticity one may also calculate the 
Lyapunov exponents [28], which requires to solve the Lorenz system and their linear 
perturbations. 

The Thrust version of the system function object for the Lorenz attractor ensem- 
ble example is presented below. It holds model parameters and provides the necessary 
operatorQ with a signature required by the odeint library. The state type is repre- 
sented by thrust :: device_vector<double>: 

typedef thrust: :device_vector<double> state_type; 

struct lorenz_system { 
size_t N; 

const state_type &R; 

lorenz_system( size_t n, const state_type &r) : N(n), R(r) { } 

void operator () (const state_type &x, state_type &dxdt, double t) const; 

h 

The X, Y , and Z components of the state are held in the continuous partitions of 
the vector. operatorQ uses the standard technique of packing the state components 
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into a zip iterator and passes the composite sequence to the thrust :: for.each algorithm 
together with the provided device function object: 

struct lorenz_functor ; 

void lorenz_system::operator() (const state_type &x, state_type &dxdt, double t) const 
{ 

thrust :: for_each( 

thrust :: make_zip .iterator ( thrust :: make_tuple( 
R.begin(), 

x.beginQ, x.bcgin() + N, x.begin() + 2 * N, 
dxdt.begin(), dxdt.begin() + N, dxdt.begin() + 2 * N ) ), 
thrust :: make_zip .iterator ( thrust :: make_tuple( 
R.end(), 

x.beginQ + N, x.beginQ + 2 * N, x.endQ, 
dxdt.beginQ + N, dxdt.beginQ + 2 * N, dxdt.endQ ) ), 
lorenz_functor () ); 

} 

The device function object unpacks the individual components and applies the re- 
quired operations to the derivative part, essentially leading to a one-to-one translation 
of (3.1) into code: 

struct lorenz_functor { 
template< class T > 

__host__ __device__ void operator ()( T t ) const { 
double R = thrust::get<0>(t); 
double x = thrust::get<l>(t); 
double y = thrust::get<2>(t); 
double z = thrust::get<3>(t); 
thrust :: get<4>(t) = sigma * ( y — x ); 
thrust :: get<5>(t) = R * x — y — x*z; 
thrust :: get<6>(t) = — b * z + x * y ; 

} 

h 

The system function object for the MTL4 version of the Lorenz attractor exam- 
ple is more compact than the Thrust variant because MTL4 supports a rich set of 
vector expressions. MTL4 provides the type multLvector that allows for expressing 
the operations directly: 

typedef mtl::dense_vector<double> vector -type; 
typedef mtl::multi_vector< vector _type> state_type; 

struct lorenz_system { 
const vector_type &R; 

explicit sys_func (const vector -type &R) : R(R) { } 

void operator Q (const state_type& x, state_type& dxdt, double t) { 
dxdt.at(O) = sigma * (x.at(l) — x.at(O)); 
dxdt.at(l) = R * x.at(O) - x.at(l) - x.at(O) * x.at(2); 
dxdt.at(2) = x.at(O) * x.at(l) - b * x.at(2); 

} 

}; 
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This class was originally designed for building a generic matrix as a set of column 
vectors and expressing all matrix operations in terms of underlying vectors' operations. 
In this sense, all operations are performed on the sub- vectors of x and dxdt where each 
vector operation is a kernel call. 

There is potential for optimization when multi- vectors are consecutively allocated 
in memory so that in our case three kernel calls can be replaced by a single call. Espe- 
cially for small problem sizes (e.g. 4096 entries), the kernel call overhead is dominant 
and one large vector operation is up to 150 % faster than three smaller ones whereas 
the difference is less than about 5 % when the vectors are larger than one million 
entries. Likewise, the evaluation of the Lorenz system could be performed with one 
kernel using MTL4's fused expressions. However, both techniques require subtle meta- 
programming treatment on the type level and are not yet available for CUDA version 
of MTL4 (we will refer to the CUDA version as CMTL4). 

The VexCL implementation of the Lorenz attractor ensemble example looks as 
compact as MTL4's. Here, the state is represented by the vex::multivector<double,3> 
type, which holds three instances of vex:: vector<double> and transparently dis- 
patches all operations to the underlying components. The code for operatorQ body 
practically coincides with the problem statement (3.1): 

typedef vcx::multivector<double, 3> state_type; 

struct lorenz_system { 

const vex::vector<double> &R; 

lorenz_system(const vex::vector<double> &r) : R(r) {} 

void operator () (const state_type &x, state_type &dxdt, double t) const { 

dxdt(0) = sigma * (x(l) - x(0)); 
dxdt(l) = R * x(0) - x(l) - x(0) * x(2); 
dxdt(2) = x(0) * x(l) - b * x(2); 

} 

h 

However, the drawback of this variant is that it leads to three kernel launches, 
namely one per each vector assignment. As we have shown in the above discussion 
of the MTL4 variant, this results in suboptimal performance. A direct use of multi- 
vectors arithmetic operations is not possible due to mixed components in the right 
hand side expressions. These additional kernel launches can be eliminated in VexCL 
by assigning a tuple of expressions to a multi-vector. The required notation is only 
slightly more cumbersome than the above variant: 

dxdt = std::tie( sigma * (x(l) — x(0)), 

R * x(0) - x(l) - x(0) * x(2), 
x(0) * x(l) - b * x(2) ); 

The performance boost of these fused expressions is a bit larger (25 % for large sys- 
tems) compared to the performance boost for the CMTL. The reason might be the 
larger launch overhead for OpenCL kernels. 

For the ViennaCL version of the Lorenz attractor example a boost:: fusion :: vector 
is used to pack the coordinate components of the state vector into a single type. 
Individual components are instances of the viennacl :: vector<double> type. The Vi- 
ennaCL kernel generation facility described in Sec. 2 is then used to avoid multiple 
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kernel launches. The symbolic variables and the custom operation are defined as static 
variables, so the corresponding OpenCL kernel is compiled only once in the first call 
of the system function object: 

typedef fusion: : vector < 

viennacl :: vector<double>, viennacl::vector<double>, viennacl::vector<double> 

> state_type; 

struct lorenz_system { 

const viennacl:: vector<double> &R; 

lorenz_system(const viennacl::vector<double> &r) : R(r) {} 

void operator () (const state_type &x, stateJype &dxdt, double t) const { 
using namespace viennacl: generator; 

// ... Definition of static symbolic values ... 

static custom_operation lorenz_op( 

sym_dX = sym_sigma * (sym_Y — sym_X), 
sym_dY = element_prod(sym_R, sym_X) — sym_Y 

— element_prod(sym_X, sym_Z), 
sym_dZ = element_prod(sym_X, sym_Y) — sym_b * sym_Z, 
"lorenz"); 

const auto &X = fusion: :at_c<0>(x); 
const auto &Y = fusion: :at_c<l>(x); 
const auto &Z = fusion: :at_c<2>(x); 

auto &dX = fusion: :at_c<0>(dxdt); 
auto &dY = fusion::at_c<l>(dxdt); 
auto &dZ = fusion::at_c<2>(dxdt); 

viennacl:: ocl :: enqueue(lorenz_op(dX, dY, dZ, X, Y, Z, R, sigma, b)); 

} 

}; 



3.2. Chain of Coupled Phase Oscillators. As a second example we consider 
a chain of coupled phase oscillators. A phase oscillator describes the dynamics of an 
autonomous oscillator [ ]. Its evolution is governed by the phase tp, which is a 2tt- 
periodic variable growing linearly in time, i.e. (p = u>, where ui is the phase velocity. 
The amplitude of the oscillator does not occur in this equation, so interesting behavior 
can only be observed if many of such oscillators are coupled. In fact, such a system 
can be used to study such divergent phenomena as synchronization, wave and pattern 
formation, phase chaos, or oscillation death [22, 29]. It is a prominent example of an 
emergent system where the coupled system shows a more complex behavior than its 
constitutes. 

The concrete example we analyze here is a chain of nearest-neighbor coupled 
phase oscillators [ ]: 

(3.2) ^ = uii + sm(ip l+1 - ipi) + sm(ipi - ipi-x). 

The index i denotes here the i-th phase in the chain. Note, that the phase velocity is 
different for each oscillator. 
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The Thrust version for the coupled phase oscillator chain looks very similar to 
the Lorenz attractor example. Again, a zip iterator is used to pack the required 
components and to process the resulting sequence with a single sweep of the for.each 
algorithm. The only difference here is that values of neighboring vector elements are 
needed. In order to provide the values, we use Thrust's permutation iterator, so that 
operatorQ of the system function object becomes 

thrust :: for_each( 

thrust :: make_zip_iterator ( 
thrust :: make_tuple( 
x. begin (), 

thrust :: make_permutation_iterator( x.begin(), prev.begin() ), 
thrust :: make_permutation_iterator( x.begin(), next.bogin() ), 
omega. begin () , dxdt.bcgin() ) ), 
thrust :: make_zip .iterator ( 
thrust :: make_tuple( 
x.end(), 

thrust :: make_permutation_iterator( x.bcginQ, prev.endQ ), 
thrust :: make_permutation_itcrator( x.beginQ, next.endQ ), 
omega.end(), dxdt.end() ) ), 
sys_functor () 



prev and next are vectors of type thrust :: device_vector<size_t> and hold the indices to 
the left and right vector elements. 

The stencil operator in MTL4 is a minimalistic matrix type. Its application is 
expressed by a matrix- vector product that is assigned to, or is used to either increment 
or decrement the vector: 

y = S * x; y += S * x; y — = S * x; 

The user must provide a function object that applies the stencil on the i-th element 
of a vector and its neighbors. For the sake of performance the function object has 
to provide two methods: one that is checking indices and to be applied near the 
beginning and the end of the vector and the other without index checking. For the 
considered example, the function object is: 

struct stencihkernel { 

static const int start = — 1, end = 1; 
int n; 

stencihkernel (int n) : n(n) {} 
template <typename Vector> 

__device__ __host__ double operator () (const Vector& v, int i) const { 

return sin(v[i— 1] — v[i]) + sin(v[i] — v[i+l]); 

} 

template <typename Vector> 
__device__ __host__ 

double outer_stencil(const Vector& v, int i, int offset = 0) const { 
double sl= i > offset? sin(v[i — 1] — v[i]) : sin(v[i ]), 

s2= i+1 < n + offset? sin(v[i] - v[i+l]) : sin(v[i ]); 
return si + s2; 
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} 

h 

The parameter offset is needed when vector parts are cached so that the addressing 
is shifted. For the sake of backward (and forward) compatibility the non-portable 
keywords __device__ and __host__ should be replaced by a macro that is defined suitable 
for the according platform, e.g. to an empty string on regular compilers. Thus, the 
user code is entirely platform-independent. 

The stencil function object is used as template argument of the stencil matrix: 

typedef mtl::dense_vector<double> statejype; 

struct phase_oscillators { 
const stateJype& omega; 
mtl : : matrix : : stencillD< stenciLkernel > S ; 

sys_func (const State& omega) : omega(omega), S(num_rows (omega)) {} 

void operator () (const State &x, State &dxdt, double t) const { 

dxdt = S * x; 
dxdt += omega; 

} 

h 

The stencil matrix S in the system function above uses shared memory to benefit from 
re-accessing vector entries and to avoid non-coalesced memory accesses. 

The VexCL version of the example is the most concise variant since VexCL pro- 
vides native support for the user-defined stencils operations. The sum of sines in (3.2) 
is encoded using the vex::StencilOperator<> class template. Template parameters are 
the body string for the generated OpcnCL function encoding required operation, 
width and center point of the stencil, and its return type. Once the stencil operator 
is defined, operatorQ of the function object is implemented with a single line of code: 

typedef vex::vector<double> statejype; 

extern const char oscillator_body[] = "return sin(X[-l] - X[0]) + sin(X[0] - X[l]);"; 

struct phase_oscillators { 
const state-type ω 

vex::StencilOperator< double, 3, 1, oscillator_body > S; 
phase_oscillators (const statejype &w) 

: omega(w), S(w.queueJistQ) { } 
void operator () (const state-type &x, state-type &dxdt, double t) const { 

dxdt = omega + S(x); 

} 

h 

VicnnaCL does not provide stencil operations required for the problem at hand, 
so we had to fallback to the use of a hand-coded OpcnCL kernel, which is omit- 
ted for reasons of brevity. Even though a custom OpcnCL kernel needs to be used 
for this example, ViennaCL provides convenient mechanisms for the compilation and 
launch of custom kernels, avoiding tedious interactions with the low-level OpcnCL 
API. As a consequence, the performance of ViennaCL for this example is compara- 
ble to hand-tuned code. The actual implementation of the phase_oscillators class 
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consequently consists of the compilation of the kernel residing in a string object 
phase_oscillator_source in the constructor, and the launch of the respective kernel in 
operator (): 

typedef viennacl::vector<double> state_type; 

struct phase_oscillators { 
const state_type ω 
sysjunc (const stateJype &w) : omega(w) { 

viennacl :: ocl :: current .context (). add_program(phase_oscillator .source, 
" oscillator .program" ). add .kernel (" oscillator " ); 

} 

void operator () (const state-type &x, state_type &dxdt, double t) const { 
viennacl :: ocl :: kernel &step = viennacl::ocl :: get_kernel( 

" oscillator _program" , " oscillator " ) ; 
viennacl :: ocl :: enqueue( step(static_cast<cl_uint>(x.size()), dxdt, x, omega) ); 

} 

}; 



3.3. Disordered Hamiltonian Lattice. The last example in our performance 
and usage study is a nonlinear disordered Hamiltonian lattice [26]. Its equations of 
motion are governed by 

( 3 - 3 ) '/< .. = Pi,j, Pi,j = -Vi,jQi,j ~ Plij + Xi'l..,- 

Here, AdQij denotes the two-dimensional discrete Laplacian A^qij = + 
Qi.j+i + Qi,j-i ~ ^Qi,j- Such systems are widely used in theoretical physics to study 
phenomena like Anderson localization [33] or thermalization [11]. 

An important property of (3.3) is its Hamiltonian nature. It can be obtained from 
the Hamilton equations and energy as well as phase volume conservation during the 
time evolution can be shown. To account for these properties, a special class of solvers 
exists, namely symplectic solvers. Odeint implements three different variants of such 
solvers, all being of the Runge-Kutta-Nystrom type [1G, 24]. The implementation of 
these solvers requires only the second part of (3.3) with pij to be specified by the 
user. 

The natural choice for the implementation of (3.3) is a sparse matrix-vector prod- 
uct. Since Thrust neither provides sparse matrix types nor sparse matrix- vector prod- 
ucts, Thrust was combined with the CUSPARSE library in order to implement this 
example. CUSPARSE contains a set of basic linear algebra subroutines used for 
handling sparse matrices and is included in the CUDA Toolkit distribution together 
with the Thrust library [1], As an alternative, the open-source Cusp library [6] for 
sparse linear algebra and graph computations on CUDA could have been used, yet 
we consider only the first option in the following. 

For better comparison, all libraries considered in our study (except for MTL4, 
see Sec. 4) use the hybrid ELL format for storing the sparse matrix on GPUs, since 
it is one of the most efficient formats for sparse matrices on these devices [ ]. The 
standard compressed sparse row format is used for CPU runs. As the construction of 
the sparse matrix for —cuf^qi + AdQij is straight-forward, we only provide code for 
the system function object interface. 

The relevant code for the Thrust version of the system function object is 



typedef thrust: :device_vector<double> state_type; 
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void operator () (const state_type &q , state_type &dp) const { 
static double one = 1; 

thrust ::transform(q.begin(), q.end(), dp.begin(), scaled_pow3_functor(— beta) ); 

cusparseDhybmv(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, 
&one, descr, A, thrust :: raw .pointer _cast(&q[0]), &one, 
thrust : : raw .pointer _cast (&dp [0] ) ) ; 
} 

Here, handle, descr, and A are CUSPARSE data structures holding the CUSPARSE 
context and sparse matrix data. The thrust :: transform() algorithm is used in lines 5 
and 6 to compute the scaled third power of the input vector q. Lines 8-10 call the 
sparse matrix- vector product kernel in CUSPARSE, where thrust :: raw _pointer_cast() 
is used to convert the thrust device vector iterator to a raw device pointer. 
The MTL4 implementation reads: 

typedef mtl::dense_vector<double> state_type; 

void operator () (const state_type& q, state_type& dp) { 
dp = A * q; 

dp — = beta * q * q * q; 
} 

Here A is an instance of a sparse matrix holding the discretization of the linear com- 
bination —ujf jqi + Aciqij- The expression q * q * q computes the triple element-wise 
product of column vector q. Usually, products of column/row vectors among them- 
selves are often program errors and therefore are not allowed in MTL4. Their use 
may be enabled by defining an according macro during compilation. 

The VexCL version employs the user-defined OpenCL function pow3, which com- 
putes the third power of its argument and is used for the sake of best performance.: 

typedef vcx::vector<double> statejtype; 

extern const char pow3_body[] = "return prml * prml * prml;"; 
vex::UserFunction<pow3_body, double(double)> pow3; 

void hamdattice:: operator () (const state_type &q, state_type &dp) const { 

dp = (—beta) * pow3(q) + A * q; 
} 

The ViennaCL version of the system function object is split into two parts: first, 
the sparse matrix- vector product Aq is computed; second, the non-linear term — /3q 3 
is added to the result by means of a custom operation: 

typedef viennacl::vector<double> state-type; 

void hamdattice:: operator () (const state_type &q, state_type &dp) const { 
using namespace viennacl: generator; 

// ... Definition of static symbolic values ... 

static custom_operation lattice_op( 

sym_dp — = sym_beta * element_prod(sym_q, element_prod(sym_q, sym_q)), 
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" hamiltonian" ) ; 

dp = viennacl:: linalg ::prod(m_A, q); 

viennacl :: ocl :: enqueue ( lattice_op (dp, q, m_beta) ); 

} 



4. Results. We present results obtained from our numerical experiments in this 
section. The complete source code for the experiments and the full set of results are 
freely available in a GitHub repository 9 . 

All the libraries tested in this paper allow for the use of both CPU and GPU 
devices. Thrust supports an OpenMP-based execution on the CPU, which is enabled 
by a compilation switch, while OpenCL libraries natively support CPUs provided that 
the respective runtime is installed. Two OpenCL implementations from AMD and 
from Intel were used. The timings provided were obtained for two GPUs, namely an 
NVIDIA Tesla C2070 and an AMD Radeon HD 7970 (Tahiti), as well as for an Intel 
Core i7 930 CPU. All reported values are median values of execution times taken for 
ten runs. 

Figures 4.1 through 4.3 show performance data for the three examples discussed 
in the previous section. The top row in each figure shows performance for CPU-based 
experiments, while bottom rows shows GPU-based data. On the GPU plots the graphs 
for NVIDIA Tesla and AMD Tahiti cards are correspondingly plotted with solid and 
dotted lines. The plots on the left show absolute solution time over the size of the 
problem being solved, while the plots on the right depict performances relative to the 
Thrust version. The only exception to the latter rule is Fig. 4.2, where ViennaCL 
is selected as a reference library for GPU data. This is due to poor performance 
of Thrust library in case of the coupled phase oscillator chain example. Absolute 
execution times for the largest problem size for all of the considered libraries are 
given in Table 4.1. 

4.1. GPU-Performance. In general, all our experiments show up to lOx to 
20 x acceleration when run on a GPU as compared to the CPU path. This is the 
expected acceleration rate for the memory bandwidth bound examples that we looked 
at. However, both CUDA and OpenCL programs show considerable overhead for the 
smaller problem sizes, thus requiring problems of sizes between 10 3 and 10 5 to see 
any significant acceleration on a GPU at all. The overhead for OpenCL libraries is 
larger than that of CUDA programs, which is mostly due to the additional kernel 
management logic required by the OpenCL runtime. 

Performance- wise, all of the considered libraries are close to each other when run 
on a GPU. Thrust performs slightly better in most cases except for the phase oscil- 
lator chain example, where the permutation iterators require the use of an auxiliary 
index vector. MTL4, VexCL and ViennaCL are in general slower than Thrust by 
a few percent, which is usually negligible in practice. Apparently, the CUSPARSE 
implementation of the sparse matrix- vector product is more efficient than that of the 
OpenCL libraries, since VexCL and ViennaCL are slower than Thrust/CUSPARSE 
combination by about 30 percent for the disordered Hamiltonian lattice experiment. 
Both OpenCL libraries show very similar results on a GPU. 

In contrast to the other libraries, CMTL4 uses a compressed row storage matrix 
(CRS) which is known for being a sub-optimal storage format on GPUs because 



9 https:/ /github.com/ddemidov/gpgpu_with_modern_cpp 
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Fig. 4.1. Lorenz attractor ensemble results. 



it causes many non-coalesced memory accesses when the matrix is processed row- 
wise. However, CMTL4 provides a specialized implementation for a matrix-vector 
product when all rows of the matrix have the same number of non-zero entries (or 
are accordingly padded with zeros). In this case, it is much easier to use shared 
memory for minimizing non-coalesced global memory access. This is subject to a 
future publication. This specialized CRS implementation in CMTL4 is twice faster 
than Thrust/CUSPARSE for very small examples and up to 50% slower for large 
examples — which is still significantly better than reported by Kraus and Forster. 
They observed that similar matrices are about four times faster in ELLPACK format 
than CRS [21]. 

ViennaCL is about 15 percent faster than Thrust in the phase oscillator chain 
example due to the use of a hand-written kernel. MTL4 provides similar performance 
with an implementation for arbitrary user-defined ID stencils. 

Conversely, the overhead of using high-level libraries is negligible compared to the 
effort spent in getting familiar with the details of CUDA or OpenCL. Thus, we have 
successfully countered productivity concerns for GPU computing raised in the past 
[8]. 
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Fig. 4.2. Coupled phase oscillator chain results. 



4.2. CPU-Performance. Looking at the performance of the experiments on 
the CPU, differences between the libraries become more pronounced. For larger 
problems Thrust outperforms VexCL by 15 to 40 percent and ViennaCL by about 
75 percent. The only exception is the phase oscillators example, where ViennaCL 
is the fastest library. The difference between the OpenCL implementations of AMD 
and the Intel is negligible in most cases. The only exception is the example of the 
chain of phase oscillators, where the implementation by Intel outperforms the one of 
AMD by up to 50 percent. This might be explained by a better implementation of 
trigonometric functions in Intel's version, producing better code for the underlying 
Intel CPU, for which there are no optimizations by AMD. 

The difference between Thrust and the OpenCL libraries can be explained by 
the fact that Thrust uses an OpenMP back end when run on a CPU. Thus, it does 
not have any overhead such as OpenCL initialization and kernel compilation. The 
gap between VexCL and ViennaCL may be attributed to the different work-group 
sizes used in their kernels and the fact that VexCL generates slightly different kernels 
when run on a CPU device. Moreover, both of the libraries are primarily aimed 
at GPU performance and do not use CPU-specific optimizations, such as employing 
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VexCL CPU (Intel) 
-° — Thrust Tesla 
-a — MTL4 Tesla 
— ViennaCL Tesla 




Fig. 4.3. Disordered Hamiltonian lattice results. 



Table 4.1 

Absolute run times (sec) for the largest problem size. 
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CMTL4 
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CPU 


2 336.14 


5 182.55 


N/A 


VexCL 
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ViennaCL 
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4 073.66 


4 612.23 


5 480.03 


ViennaCL 


CPU (Intel) 


3 943.59 


3 040.05 


5 437.47 
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(a) Lorenz attractor ensemble. 
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(b) Coupled phase oscillator 
chain. 
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(c) Disordered Hamiltonian lat- 
tice. 



Fig. 4.4. VexCL scaling with multigpu computation. 

OpenCL vector data types that facilitate the use of SSE instructions. Still, library- 
specifics aside, the overhead of OpenCL for small problem sizes is tremendous, if 
not embarrassing, hence OpenCL cannot be considered to be a competitive CPU 
programming model for a large area of applications in its present state. 

Finally, results for multi-GPU usage as provided by VexCL in a transparent way 
are considered. Fig. 4.4 shows scaling results for up to three CPUs. It can be seen 
that a notable speed-up for several Tesla GPUs over a single card is only obtained for 
problem sizes larger than 10 6 . It seems that AMD's OpenCL implementation does 
not work very well with multiple GPUs employed. Still, the combined memory of 
several GPUs allows to solve proportionally larger problems on the same system. 

5. Conclusion. Performance- wise, there is almost no difference between various 
platforms and libraries when those are run on the same hardware for large problem 
sizes. As we have shown, various computational problems may be solved effectively in 
terms of both human and machine time with the help of modern high-level libraries. 
Hence, the differences in the programming interfaces of the libraries are more likely 
to determine the choice of a particular library for a specific application rather than 
raw performance. 

The focus of Thrust is more on providing low-level primitives with an interface 
very close to the C++ STL library. Special purpose functionality is available via 
separate libraries such as CUSPARSE and can be integrated without a lot of effort. 
The rest of the libraries we looked at demonstrated that they are able to provide a 
more convenient interface for a scientific programmer than a direct implementation 
in CUDA or OpenCL. MTL4 and VexCL have a richer set of element- wise vector 
operations and allow for the shortest implementations in the context of the ODEs 
considered in this work. VicnnaCL required a few additional lines of code including 
a small custom OpenCL kernel for one of our examples. Still, this extra effort is 
acceptable considering that the library's focus is on sparse linear systems solvers, 
which are, however, beyond the scope of this paper. 

Regarding a comparison of CUDA versus OpenCL, the main difference observed 
in this work is the wider range of hardware supported by OpenCL. Although the 
performance obtained via CUDA is a few percent better than that of OpenCL on 
the overall, the differences are mostly too small in order to make a decision in favor 
of CUDA based on performance only. Moreover, the slight performance advantage 
of CUDA can still turn into a disadvantage when taking the larger set of hardware 
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supporting OpenCL into consideration. 

Another aspect that has not been studied in this work is the ability to generate 
kernels optimized for the problem at hand at runtime. This allows, for example, to 
generate optimized kernels for a certain set of parameters supplied, eliminating any 
otherwise spurious reads from global memory. An in-depth study of such an approach 
is, however, left for future work. 
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